对从FFPE组织块制备的载玻片上切割的染色组织的光学显微镜检查是组织诊断的金标准。此外,任何病理学家的诊断能力和专业知识都取决于他们在常见和稀有变体形态上的直接经验。最近,深度学习方法已被用来成功显示此类任务的高度准确性。但是,获得专家级注释的图像是一项昂贵且耗时的任务,人为合成的组织学图像可能会非常有益。在这里,我们提出了一种方法,不仅可以生成组织学图像,从而重现普通疾病的诊断形态特征,而且还提供了产生新的和罕见形态的用户能力。我们的方法涉及开发一种生成的对抗网络模型,该模型综合了由类标签约束的病理图像。我们研究了该框架合成现实的前列腺和结肠组织图像的能力,并评估了这些图像在增强机器学习方法的诊断能力以及通过一组经验丰富的解剖病理学家的可用性方面的实用性。我们的框架生成的合成数据在训练深度学习模型中进行了类似于实际数据进行诊断。病理学家无法区分真实图像和合成图像,并显示出相似的前列腺癌分级的观察者间一致性。我们扩展了从结肠活检中显着复杂图像的方法,并表明也可以再现了此类组织中的复杂微环境。最后,我们介绍了用户通过简单的语义标签标记来生成深层组织学图像的能力。
translated by 谷歌翻译
The previous fine-grained datasets mainly focus on classification and are often captured in a controlled setup, with the camera focusing on the objects. We introduce the first Fine-Grained Vehicle Detection (FGVD) dataset in the wild, captured from a moving camera mounted on a car. It contains 5502 scene images with 210 unique fine-grained labels of multiple vehicle types organized in a three-level hierarchy. While previous classification datasets also include makes for different kinds of cars, the FGVD dataset introduces new class labels for categorizing two-wheelers, autorickshaws, and trucks. The FGVD dataset is challenging as it has vehicles in complex traffic scenarios with intra-class and inter-class variations in types, scale, pose, occlusion, and lighting conditions. The current object detectors like yolov5 and faster RCNN perform poorly on our dataset due to a lack of hierarchical modeling. Along with providing baseline results for existing object detectors on FGVD Dataset, we also present the results of a combination of an existing detector and the recent Hierarchical Residual Network (HRN) classifier for the FGVD task. Finally, we show that FGVD vehicle images are the most challenging to classify among the fine-grained datasets.
translated by 谷歌翻译
Long-term OCR services aim to provide high-quality output to their users at competitive costs. It is essential to upgrade the models because of the complex data loaded by the users. The service providers encourage the users who provide data where the OCR model fails by rewarding them based on data complexity, readability, and available budget. Hitherto, the OCR works include preparing the models on standard datasets without considering the end-users. We propose a strategy of consistently upgrading an existing Handwritten Hindi OCR model three times on the dataset of 15 users. We fix the budget of 4 users for each iteration. For the first iteration, the model directly trains on the dataset from the first four users. For the rest iteration, all remaining users write a page each, which service providers later analyze to select the 4 (new) best users based on the quality of predictions on the human-readable words. Selected users write 23 more pages for upgrading the model. We upgrade the model with Curriculum Learning (CL) on the data available in the current iteration and compare the subset from previous iterations. The upgraded model is tested on a held-out set of one page each from all 23 users. We provide insights into our investigations on the effect of CL, user selection, and especially the data from unseen writing styles. Our work can be used for long-term OCR services in crowd-sourcing scenarios for the service providers and end users.
translated by 谷歌翻译
We introduce LaViLa, a new approach to learning video-language representations by leveraging Large Language Models (LLMs). We repurpose pre-trained LLMs to be conditioned on visual input, and finetune them to create automatic video narrators. Our auto-generated narrations offer a number of advantages, including dense coverage of long videos, better temporal synchronization of the visual information and text, and much higher diversity of text. The video-text embedding learned contrastively with these additional auto-generated narrations outperforms the previous state-of-the-art on multiple first-person and third-person video tasks, both in zero-shot and finetuned setups. Most notably, LaViLa obtains an absolute gain of 10.1% on EGTEA classification and 5.9% Epic-Kitchens-100 multi-instance retrieval benchmarks. Furthermore, LaViLa trained with only half the narrations from the Ego4D dataset outperforms baseline models trained on the full set, and shows positive scaling behavior on increasing pre-training data and model size.
translated by 谷歌翻译
We explore unifying a neural segmenter with two-pass cascaded encoder ASR into a single model. A key challenge is allowing the segmenter (which runs in real-time, synchronously with the decoder) to finalize the 2nd pass (which runs 900 ms behind real-time) without introducing user-perceived latency or deletion errors during inference. We propose a design where the neural segmenter is integrated with the causal 1st pass decoder to emit a end-of-segment (EOS) signal in real-time. The EOS signal is then used to finalize the non-causal 2nd pass. We experiment with different ways to finalize the 2nd pass, and find that a novel dummy frame injection strategy allows for simultaneous high quality 2nd pass results and low finalization latency. On a real-world long-form captioning task (YouTube), we achieve 2.4% relative WER and 140 ms EOS latency gains over a baseline VAD-based segmenter with the same cascaded encoder.
translated by 谷歌翻译
Damage to the inferior frontal gyrus (Broca's area) can cause agrammatic aphasia wherein patients, although able to comprehend, lack the ability to form complete sentences. This inability leads to communication gaps which cause difficulties in their daily lives. The usage of assistive devices can help in mitigating these issues and enable the patients to communicate effectively. However, due to lack of large scale studies of linguistic deficits in aphasia, research on such assistive technology is relatively limited. In this work, we present two contributions that aim to re-initiate research and development in this field. Firstly, we propose a model that uses linguistic features from small scale studies on aphasia patients and generates large scale datasets of synthetic aphasic utterances from grammatically correct datasets. We show that the mean length of utterance, the noun/verb ratio, and the simple/complex sentence ratio of our synthetic datasets correspond to the reported features of aphasic speech. Further, we demonstrate how the synthetic datasets may be utilized to develop assistive devices for aphasia patients. The pre-trained T5 transformer is fine-tuned using the generated dataset to suggest 5 corrected sentences given an aphasic utterance as input. We evaluate the efficacy of the T5 model using the BLEU and cosine semantic similarity scores. Affirming results with BLEU score of 0.827/1.00 and semantic similarity of 0.904/1.00 were obtained. These results provide a strong foundation for the concept that a synthetic dataset based on small scale studies on aphasia can be used to develop effective assistive technology.
translated by 谷歌翻译
This work introduces the novel task of Source-free Multi-target Domain Adaptation and proposes adaptation framework comprising of \textbf{Co}nsistency with \textbf{N}uclear-Norm Maximization and \textbf{Mix}Up knowledge distillation (\textit{CoNMix}) as a solution to this problem. The main motive of this work is to solve for Single and Multi target Domain Adaptation (SMTDA) for the source-free paradigm, which enforces a constraint where the labeled source data is not available during target adaptation due to various privacy-related restrictions on data sharing. The source-free approach leverages target pseudo labels, which can be noisy, to improve the target adaptation. We introduce consistency between label preserving augmentations and utilize pseudo label refinement methods to reduce noisy pseudo labels. Further, we propose novel MixUp Knowledge Distillation (MKD) for better generalization on multiple target domains using various source-free STDA models. We also show that the Vision Transformer (VT) backbone gives better feature representation with improved domain transferability and class discriminability. Our proposed framework achieves the state-of-the-art (SOTA) results in various paradigms of source-free STDA and MTDA settings on popular domain adaptation datasets like Office-Home, Office-Caltech, and DomainNet. Project Page: https://sites.google.com/view/conmix-vcl
translated by 谷歌翻译
检测障碍对于安全有效的自动驾驶至关重要。为此,我们提出了NVRadarnet,这是一种深神经网络(DNN),它使用汽车雷达传感器检测动态障碍物和可驱动的自由空间。该网络利用从多个雷达传感器的时间积累的数据来检测动态障碍,并在自上而下的鸟类视图(BEV)中计算其方向。该网络还可以回归可驱动的自由空间,以检测未分类的障碍。我们的DNN是第一个使用稀疏雷达信号的同类DNN,以实时从雷达数据实时执行障碍物和自由空间检测。在实际的自动驾驶场景中,该网络已成功地用于我们的自动驾驶汽车。该网络在嵌入式GPU上的运行速度快于实时时间,并且在地理区域显示出良好的概括。
translated by 谷歌翻译
在良好的弹药条件下,车辆检测准确性相当准确,但在弱光条件下容易受到检测准确性不佳。弱光和眩光的组合效果或尾灯的眩光导致最新的对象检测模型更有可能错过车辆检测。但是,热红外图像对照明的变化是可靠的,并且基于热辐射。最近,生成对抗网络(GAN)已在图像域传输任务中广泛使用。最先进的GAN型号试图通过将红外图像转换为白天的RGB图像来提高夜间车辆检测准确性。但是,与白天条件相比,在夜间条件下,这些模型在夜间条件下表现不佳。因此,这项研究试图通过提出三种不同的方法来缓解这一缺点,该方法基于两个不同级别的GAN模型的组合,试图减少白天和夜间红外图像之间的特征分布差距。通过使用最新的对象检测模型测试模型,可以完成定量分析以比较提出模型的性能与最新模型的性能。定量和定性分析都表明,所提出的模型在夜间条件下的最新车辆检测模型优于最先进的GAN模型,显示了所提出的模型的功效。
translated by 谷歌翻译
艺术是一种使用数字技术作为生成或创造过程的一部分的艺术方法。随着数字货币和NFT(不可杀死的代币)的出现,对数字艺术的需求正在积极增长。在本手稿中,我们主张将深层生成网络和对抗性训练进行稳定和变体的艺术生成的概念。这项工作主要集中于使用深卷积生成对抗网络(DC-GAN),并探讨了解决GAN训练中常见陷阱的技术。我们比较DC-GAN的各种架构和设计,以为稳定而逼真的一代提供推荐的设计选择。这项工作的主要重点是生成现实中不存在但由提议的模型从随机噪声中合成的逼真图像。我们提供了生成的动物面部图像(一些显示物种混合物的证据)的视觉结果以及训练,建筑和设计选择的建议。我们还展示了训练图像预处理如何在GAN培训中起着重要作用。
translated by 谷歌翻译